The primary aim of this report is to uncover any interesting insights from the 2020 Semester 2 DATA2X02 class survey. Additionally, improvements for future surveys are also discussed. The main questions discussed are:
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(tidyverse)
library(dplyr)
library(janitor)
library(skimr)
library(visdat)
library(gt)
library(kableExtra)
library(tibble)
library(gtsummary)
library(ggthemes)
library(reshape2)
library(ggpubr)
library(gridExtra)
library(plotly)
library(pwr)A majority of the following code was taken from Dr Garth Tarr’s lecture examples, as it converts the raw data’s variables names into names that are more manageable for analysis. Filtering for NA values and unnatural responses is performed in later sections when the variables are used for hypothesis testing.
raw = read_csv("DATA2X02 class survey 2020 (Responses) - Form responses 1.csv")
x = raw %>% clean_names()
colnames(x) = stringr::str_replace(string = colnames(x),
pattern = "what_is_your_",
replacement = "")
colnames(x) = stringr::str_replace(string = colnames(x),
pattern = "on_average_how_many_hours_per_week_did_you_",
replacement = "")
colnames(x)[2] = "covid_test"
colnames(x)[4] = "postcode"
colnames(x)[5] = "dentist"
colnames(x)[6] = "university_work"
colnames(x)[7] = "social_media"
colnames(x)[8] = "dog_or_cat"
colnames(x)[9] = "live_with_parents"
colnames(x)[10] = "exercising"
colnames(x)[12] = "asthma"
colnames(x)[13] = "paid_work"
colnames(x)[14] = "fav_season"
colnames(x)[16] = "height"
colnames(x)[17] = "floss_frequency"
colnames(x)[18] = "glasses"
colnames(x)[20] = "steak_preference"
colnames(x)[21] = "stress_level"
x = x %>% mutate(
postcode = as.character(postcode),
timestamp = lubridate::dmy_hms(timestamp)
)
#skim(x)
#vis_miss(x)The data was collected from a survey which was made available to all DATA2x02 students for a couple of days. Out of ~572 people there were 174 responses. The survey was created by Dr Garth Tarr and hosted on Google Forms.
## [1] 10.34483
The general DATA2x02 cohort may not be comfortable with sharing their postcode for privacy and security reasons. Hence, a non-response bias is highly probable. Observing the relatively high percentage of missing values for the postcode question (~10%) , it can be deduced that such a bias occurs.
These variables are subject to social desirability bias as spending more time on university work, exercising and flossing is considered to be more socially acceptable and productive. Hence, surveyees may exaggerate their response and submit longer hours.
As no sampling technique removes all possible biases, this survey data cannot be considered a purely random sample of DATA2x02 students.
The survey was only avaliable to students for a couple of days (a relatively short period of time). Hence, a sampling bias emerges, as students who are ‘quick to act’ are preferred over students who ‘procrastinated’ the survey.
On the other hand, the survey was posted on ED, a forum which is easily accessible to all students. Hence, this would effectively remove any exclusion bias.
Overall, whilst this sample cannot truly be considered a ‘random’ sample, we can assert that the sampling techniques reduce the chances of biases.
## [1] 6 42 10 45 NA 12
The variable cannot be used to determine valid results as we are uncertain about the measurement unit used. As seen above in most instances we cannot determine whether US,UK or Euro sizing was used. So, to generate useful data the question should ask for a specific measurement unit.
## [1] 160.00 1.78 178.00 175.00 NA
Question should mention the units (either cm or m), as currently there exist values typed in cm and m (as seen above). Whilst this data can be converted to one unit appropriately, it is not ideal.
## [1] "Female" "Male" "Male" "Male" "Male" "Male" "Male" "male"
## [9] "Female" "Male" NA "female" "Male" "Male" NA "Female"
## [17] "Female" "Male" "Male" "Male"
Instead of taking text input, a drop menu containing several options should be used. This avoids the need to categorise each string into a gender and eliminates the risk of incorrectly parsing the data (due to unusual inputs).
y = x %>%
filter(!is.na(covid_test)) %>%
group_by(covid_test) %>%
count()
n = sum(y$n)
#estimating the lambda parameter from the sample
l = sum(y$n * y$covid_test)/n
p = dpois(y$covid_test, lambda = l)
sample_counts = y$n
tests = y$covid_test
poisson_counts = n*p
df1 = data.frame(sample_counts, poisson_counts, tests )
df2 = melt(df1, id.vars="tests")
g1 = df2 %>%
ggplot() +
aes(x = tests, y=value, fill=variable) +
geom_bar(stat='identity', position='dodge') +
labs(x = "Number of COVID tests", y = "Counts", fill = "Sample",
caption = "Comparison between theoritical Poisson and sample distribution") +
scale_fill_manual(labels = c("Sample Counts", "Poisson Counts"), values = c("indianred2", "turquoise3"))+
theme_economist()
g1When compared to theoretical counts derived from the Poisson, the number of COVID tests does seem to follow the shape of a Poisson distribution. However, for some categories there are significant differences between the observed and the theoretical, which may suggest that it isn’t Poisson distributed.
Hence to test this we formulate the following hypotheses and run a Chi-squared Goodness of Fit Test:
\(H_0\): The number of COVID tests follow a Poisson distribution
\(H_1\): The number of COVID tests do not follow a Poisson distribution
Set \(\alpha = 0.05\)
len = length(y$covid_test)
p[len] = 1 - sum(p[1:(len-1)])
res = chisq.test(y$n, p=p)
ey = res$expected
res$expected## [1] 9.843092e+01 5.493819e+01 1.533159e+01 2.852388e+00 3.980077e-01
## [6] 4.442876e-02 4.132908e-03 3.540346e-04
We observe above that there are a number of cells where the expected counts are less than 5, hence violating our assumption. Hence, we group the first last 6 cells to fix this to the get the expected counts, as seen below
grp = 2
y = y$n
yr = c(y[1:grp], sum(y[(grp+1):len]))
pr = c(p[1:grp], sum(p[(grp+1):len]))
lenr = length(yr)
eyr = c(ey[1:grp], sum(ey[(grp+1):len]))
eyr## [1] 98.43092 54.93819 18.63090
\(T = \sum_{i = 0}^{2} \frac{(Y_i - np_i)^2}{np_i}\), under \(H_0, T\) ~ \(\chi_1^2\)
## [1] 3
Hence the degrees of freedom are \(3-1-1 = 1\), as after grouping we have 3 categories and the data was also used to estimate the \(\lambda\) parameter.
res = chisq.test(yr, p = pr)
# res$expected
# res$parameter
p_val = pchisq(res$statistic, df = 1, lower.tail = FALSE)
p_val## X-squared
## 9.336178e-06
Since the observed p-value of \(9.33\times10^{-6}\) is less than our alpha, we reject the null hypothesis; hence concluding that the number of COVID tests does not follow a Poisson distribution.
This hypothesis aims to determine if students with glasses or contacts spend more time on university work than their counterparts.
Rationale behind this test
On the assumption that smarter students spend more time studying, this test aims to determine if the sample aligns with the results of Williams, Katie M et al in their report “Phenotypic and genotypic correlation between myopia and intelligence.”. They conclude that whilst there may seem a significant relationship myopia and IQ (in the highest IQ quartile), 78% of these were explained by genetic effects.
Let \(\mu_x\) be the average amount of hours spent on university work spent by students with glasses or contacts. (\(X_i's\))
Let \(\mu_y\) be the average amount of hours spent on university work spent by students without glasses or contacts. (\(Y_i's\))
Exploratory analysis suggests that there may not be a significant difference in the means of all the \(X_i's\) and \(Y_i's\), as seen in the boxplot below. On average both groups of students spend around ~27 hours per week on university work (Table 5.1).
glasses = x %>%
filter(glasses == "Yes") %>%
dplyr::select(university_work) %>%
filter(!is.na(university_work)) %>%
filter(university_work < 80)
no_glasses = x %>%
filter(glasses == "No") %>%
dplyr::select(university_work) %>%
filter(!is.na(university_work)) %>%
filter(university_work < 80)
b1 = x %>%
filter(university_work < 80) %>%
filter(!is.na(university_work)) %>%
filter(!is.na(glasses)) %>%
ggplot() +
aes(x = glasses, y = university_work) +
geom_boxplot() +
geom_jitter(width=0.15, size = 1.5, colour = "blue") +
labs(x = "Wears glasses", y = "Hours spent on university work") +
theme_economist()
ggplotly(b1)summary = matrix(c(mean(glasses$university_work), sd(glasses$university_work),
mean(no_glasses$university_work), sd(no_glasses$university_work)), ncol = 2, byrow=TRUE)
colnames(summary) = c("Mean (hrs) ", "SD (hrs)")
rownames(summary) = c("No", "Yes")
kableExtra::kable(summary, caption = "Mean and SD summaries") %>%
kable_styling(bootstrap_options = c("hover", "striped", "condensed")) %>%
kable_classic_2() | Mean (hrs) | SD (hrs) | |
|---|---|---|
| No | 27.71739 | 14.65600 |
| Yes | 27.37500 | 14.11466 |
We begin by establishing the hypotheses and running a Two-Sample t-test to test for the difference in the means.
\(H_0\): \(\mu_x = \mu_y\)
\(H_1\): \(\mu_x > \mu_y\)
Set \(\alpha = 0.05\)
All \(X_i's\) are are iid \(N(\mu_x, \sigma^2)\) Observing the qqplot below, we can assume that all \(X_i's\) are normal.
All \(Y_i's\) are are iid \(N(\mu_y, \sigma^2)\) Observing the qqplot below, we can assume that all \(Y_i's\) are normal.
q1 = ggqqplot(glasses$university_work) +
labs(x = "Theoritical quantiles from a normal distribution",
y = "Quantiles from sample",
caption = "For X_i sample") +
theme_economist()
q2 = ggqqplot(no_glasses$university_work) +
labs(x = "Theoritical quantiles from a normal distribution",
y = "Quantiles from sample",
caption = "For Y_i sample") +
theme_economist()
grid.arrange(q1, q2, ncol = 2)Since both samples have a standard deviation of ~14 (Table 5.1), we can assume that all \(X_i's\) and \(Y_i's\) have equal variances.
Given that each student only did one survey, it is safe to assume that all \(X_i's\) are independent of all \(Y_i's\)
\(T = \frac{\bar{X} - \bar{Y}}{s_p\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}\), where \(S_p^2 = \frac{(n_x - 1)S_x^2 + (n_y - 1)S_y^2}{n_x + n_y - 2}\), with \(T\) ~ \(t_{162}\)
Degrees of freedom calculation shown below.
## [1] 162
t_res = t.test(glasses$university_work, no_glasses$university_work,
alternative="greater", conf.level = 0.95, var.equal = TRUE)
t_res##
## Two Sample t-test
##
## data: glasses$university_work and no_glasses$university_work
## t = 0.15089, df = 162, p-value = 0.4401
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -3.411507 Inf
## sample estimates:
## mean of x mean of y
## 27.71739 27.37500
Since the observed p-value of \(0.4401\) is larger than our alpha (0.05), we accept the null hypothesis. Additionally, we can confirm this by observing that with mean difference of 0 is within the 95% confidence interval.
Hence, we conclude than there is not a significant difference between the hours spent on university work by students with glasses or contacts and without them. This aligns with the results from the study conducted by Williams, Katie M et al.
Overall, many interesting conclusions were drawn through the analysis of the survey data; some of them confirming existing ideas whilst others provided new insights.
3.4.4 Social Media - What is your favourite social media platform?
Similar reasoning for the gender question. People may misspell or provide more than one social media platform. (as seen above)